Quarto is a powerful typesetting tool. Instead of writing the report or presentation in Word, PowerPoint, or a Google Doc; running code in an R script; saving output and visualizations; and pasting them in the report or the presentation – we can do it all in Quarto.
The header at the top of a Quarto document coded in YAML (i.e., Yet Another Markup Language).
--- title: "Exercise 5" author: "Marc Dotson" format: docx ---
While Quarto can produce HTML documents, PowerPoint slides, slides (like these), GitHub documents, blogs, webpages, etc. – unless specified otherwise, all subsequent assignments will be submitted as Word documents rendered from a Quarto document.
## headings and ### sub-headings to clearly identify sections.‘‘‘{r} and end with ‘‘‘ and should include comments but not text.functions and data with `s, italics with *s, and bold with **s.library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.4 ✔ readr 2.1.5 ## ✔ forcats 1.0.0 ✔ stringr 1.5.1 ## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
crm_data <- read_csv("customer_data.csv") |>
left_join(read_csv("store_transactions.csv"), join_by(customer_id))
## Rows: 10531 Columns: 14 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (8): gender, married, college_degree, region, state, review_time, review... ## dbl (6): customer_id, birth_year, income, credit, review_id, star_rating ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ## Rows: 10531 Columns: 169 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## dbl (169): customer_id, jan_2005, feb_2005, mar_2005, apr_2005, may_2005, ju... ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
How can we summarize the transactions for 2018 by region?
crm_data |>
select(region, customer_id, contains("2018"))
## # A tibble: 10,531 × 14 ## region customer_id jan_2018 feb_2018 mar_2018 apr_2018 may_2018 jun_2018 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 South 1001 1 0 0 0 0 4 ## 2 West 1002 0 0 0 0 0 0 ## 3 South 1003 4 0 0 0 0 0 ## 4 Midwest 1004 0 0 0 2 0 0 ## 5 West 1005 0 1 0 0 0 0 ## 6 Midwest 1006 0 0 0 0 0 0 ## 7 Midwest 1007 0 0 0 4 0 0 ## 8 South 1008 0 0 0 0 0 0 ## 9 West 1009 0 0 0 0 1 0 ## 10 Northeast 1010 0 0 0 0 0 0 ## # ℹ 10,521 more rows ## # ℹ 6 more variables: jul_2018 <dbl>, aug_2018 <dbl>, sep_2018 <dbl>, ## # oct_2018 <dbl>, nov_2018 <dbl>, dec_2018 <dbl>
Tidy data is defined as follows:
This may seem obvious or simple, but this common philosophy is at the heart of the tidyverse. It also means we will often prefer longer datasets to wider datasets and {tidyr} will help us move between the two.
The most common identifying feature of messy data is when column names are really values.
crm_data |>
select(region, customer_id, contains("2018"))
## # A tibble: 10,531 × 14 ## region customer_id jan_2018 feb_2018 mar_2018 apr_2018 may_2018 jun_2018 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 South 1001 1 0 0 0 0 4 ## 2 West 1002 0 0 0 0 0 0 ## 3 South 1003 4 0 0 0 0 0 ## 4 Midwest 1004 0 0 0 2 0 0 ## 5 West 1005 0 1 0 0 0 0 ## 6 Midwest 1006 0 0 0 0 0 0 ## 7 Midwest 1007 0 0 0 4 0 0 ## 8 South 1008 0 0 0 0 0 0 ## 9 West 1009 0 0 0 0 1 0 ## 10 Northeast 1010 0 0 0 0 0 0 ## # ℹ 10,521 more rows ## # ℹ 6 more variables: jul_2018 <dbl>, aug_2018 <dbl>, sep_2018 <dbl>, ## # oct_2018 <dbl>, nov_2018 <dbl>, dec_2018 <dbl>
When column names are really values, the data frame ends up being wider than it should be. Use pivot_longer() to pivot the data frame longer by turning column names into values.
crm_long <- crm_data |>
select(region, customer_id, contains("2018")) |>
pivot_longer(
-c(region, customer_id),
names_to = "month_year",
values_to = "transactions"
)
Note how much longer the data frame is and why.
crm_long
## # A tibble: 126,372 × 4 ## region customer_id month_year transactions ## <chr> <dbl> <chr> <dbl> ## 1 South 1001 jan_2018 1 ## 2 South 1001 feb_2018 0 ## 3 South 1001 mar_2018 0 ## 4 South 1001 apr_2018 0 ## 5 South 1001 may_2018 0 ## 6 South 1001 jun_2018 4 ## 7 South 1001 jul_2018 0 ## 8 South 1001 aug_2018 0 ## 9 South 1001 sep_2018 0 ## 10 South 1001 oct_2018 0 ## # ℹ 126,362 more rows
Now summarizing the transactions for 2018 by region is trivial.
crm_long |>
group_by(region) |>
summarize(
total_transactions = sum(transactions),
avg_transactions = mean(transactions)
)
## # A tibble: 4 × 3 ## region total_transactions avg_transactions ## <chr> <dbl> <dbl> ## 1 Midwest 10092 0.764 ## 2 Northeast 29958 0.774 ## 3 South 10501 0.788 ## 4 West 47375 0.775
If the data has the opposite problem and has values that should really be column names, use pivot_wider() to pivot the data frame wider by turning values into column names.
crm_long |>
pivot_wider(
names_from = month_year,
values_from = transactions
)
## # A tibble: 10,531 × 14 ## region customer_id jan_2018 feb_2018 mar_2018 apr_2018 may_2018 jun_2018 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 South 1001 1 0 0 0 0 4 ## 2 West 1002 0 0 0 0 0 0 ## 3 South 1003 4 0 0 0 0 0 ## 4 Midwest 1004 0 0 0 2 0 0 ## 5 West 1005 0 1 0 0 0 0 ## 6 Midwest 1006 0 0 0 0 0 0 ## 7 Midwest 1007 0 0 0 4 0 0 ## 8 South 1008 0 0 0 0 0 0 ## 9 West 1009 0 0 0 0 1 0 ## 10 Northeast 1010 0 0 0 0 0 0 ## # ℹ 10,521 more rows ## # ℹ 6 more variables: jul_2018 <dbl>, aug_2018 <dbl>, sep_2018 <dbl>, ## # oct_2018 <dbl>, nov_2018 <dbl>, dec_2018 <dbl>
If two (or more) values are in one column, separate() the values into two (or more) columns.
crm_long
## # A tibble: 126,372 × 4 ## region customer_id month_year transactions ## <chr> <dbl> <chr> <dbl> ## 1 South 1001 jan_2018 1 ## 2 South 1001 feb_2018 0 ## 3 South 1001 mar_2018 0 ## 4 South 1001 apr_2018 0 ## 5 South 1001 may_2018 0 ## 6 South 1001 jun_2018 4 ## 7 South 1001 jul_2018 0 ## 8 South 1001 aug_2018 0 ## 9 South 1001 sep_2018 0 ## 10 South 1001 oct_2018 0 ## # ℹ 126,362 more rows
crm_long <- crm_long |>
separate(month_year, c("month", "year"), sep = "_")
crm_long
## # A tibble: 126,372 × 5 ## region customer_id month year transactions ## <chr> <dbl> <chr> <chr> <dbl> ## 1 South 1001 jan 2018 1 ## 2 South 1001 feb 2018 0 ## 3 South 1001 mar 2018 0 ## 4 South 1001 apr 2018 0 ## 5 South 1001 may 2018 0 ## 6 South 1001 jun 2018 4 ## 7 South 1001 jul 2018 0 ## 8 South 1001 aug 2018 0 ## 9 South 1001 sep 2018 0 ## 10 South 1001 oct 2018 0 ## # ℹ 126,362 more rows
Now we can summarize the transactions for 2018 by month and region.
crm_long |>
group_by(month, region) |>
summarize(
total_transactions = sum(transactions),
avg_transactions = mean(transactions)
) |>
arrange(desc(avg_transactions))
## `summarise()` has grouped output by 'month'. You can override using the ## `.groups` argument.
## # A tibble: 48 × 4 ## # Groups: month [12] ## month region total_transactions avg_transactions ## <chr> <chr> <dbl> <dbl> ## 1 nov South 1812 1.63 ## 2 dec Midwest 1723 1.56 ## 3 dec West 7922 1.55 ## 4 dec Northeast 4928 1.53 ## 5 dec South 1693 1.52 ## 6 nov West 7656 1.50 ## 7 nov Northeast 4824 1.50 ## 8 nov Midwest 1646 1.50 ## 9 apr Midwest 751 0.682 ## 10 may South 749 0.674 ## # ℹ 38 more rows
When two (or more) values should be in one column, unite() the values into one column.
crm_long |>
unite("month_year", c(month, year), sep = "_")
## # A tibble: 126,372 × 4 ## region customer_id month_year transactions ## <chr> <dbl> <chr> <dbl> ## 1 South 1001 jan_2018 1 ## 2 South 1001 feb_2018 0 ## 3 South 1001 mar_2018 0 ## 4 South 1001 apr_2018 0 ## 5 South 1001 may_2018 0 ## 6 South 1001 jun_2018 4 ## 7 South 1001 jul_2018 0 ## 8 South 1001 aug_2018 0 ## 9 South 1001 sep_2018 0 ## 10 South 1001 oct_2018 0 ## # ℹ 126,362 more rows
Of all the {tidyr} functions, unite() might seem the least useful. However, we really would like month_year to be an actual date. If we can add a day, we can use {lubridate} to create a date.
crm_long <- crm_long |>
mutate(day = 1) |>
unite("date", c(day, month, year), sep = "-") |>
mutate(date = dmy(date))
crm_long
## # A tibble: 126,372 × 4 ## region customer_id date transactions ## <chr> <dbl> <date> <dbl> ## 1 South 1001 2018-01-01 1 ## 2 South 1001 2018-02-01 0 ## 3 South 1001 2018-03-01 0 ## 4 South 1001 2018-04-01 0 ## 5 South 1001 2018-05-01 0 ## 6 South 1001 2018-06-01 4 ## 7 South 1001 2018-07-01 0 ## 8 South 1001 2018-08-01 0 ## 9 South 1001 2018-09-01 0 ## 10 South 1001 2018-10-01 0 ## # ℹ 126,362 more rows
With a long data frame and a date column, we can plot a time series of transactions.
crm_long |>
group_by(date, region) |>
summarize(
total_transactions = sum(transactions),
avg_transactions = mean(transactions)
) |>
ggplot(aes(x = date, y = avg_transactions, color = region)) +
geom_line() +
scale_x_date(date_breaks = "month", date_labels = "%b")
## `summarise()` has grouped output by 'date'. You can override using the ## `.groups` argument.
We’ve been using data frames (technically tibbles, a modern data frame). A data frame is composed of columns called vectors. Both data frames and vectors are classes of data.
Each vector has a single data type. We’ve discussed double (i.e., numeric), integer, date, character, and factor. If we try to mix data types in a vector, it will pick the easiest one to satisfy.
vector_example <- c(1, 2, "three") vector_example
## [1] "1" "2" "three"
Data frames are nice to work with because each vector can be of a different data type.
tibble(id = 1:3, state = "AZ")
## # A tibble: 3 × 2 ## id state ## <int> <chr> ## 1 1 AZ ## 2 2 AZ ## 3 3 AZ
Sometimes we need to coerce a data class or type.
as_tibble(vector_example)
## # A tibble: 3 × 1 ## value ## <chr> ## 1 1 ## 2 2 ## 3 three
Why would we want to coerce a data class?
We can similarly coerce data types with as.*() functions (e.g., as.numeric() and as.character()).
Coercing dates can be tricky.
We often want to coerce factors using the fct_*() functions.
Why would we want to coerce a data type, especially factors?
crm_data |> mutate(region = fct_infreq(region)) |> ggplot(aes(x = region)) + geom_bar()
Note that geom_bar() is a wrapper for both count() and geom_col() (i.e., only a single variable is needed and the count is performed as part of the plot).
We’ve already used fct_reorder() to coerce a factor ordered by another variable. Now we’ve used fct_infreq() to coerce a factor ordered by frequency and, if you like, fct_rev() to reverse that order.
crm_data |> mutate(region = region |> fct_infreq() |> fct_rev()) |> ggplot(aes(x = region)) + geom_bar()
Summary
Next Time
Supplementary Material
Artwork by @allison_horst
Customers are often analyzed based on those who have made recent purchases, frequent purchases, and spent the most. This is known as a recency, frequency, monetary (RFM) analysis. Now that we can tidy data, we can analyze customers based on recent purchases and frequent purchases. As you work, remember to sketch out what you’d like the data to look like. In RStudio on Posit Cloud, create a new Quarto document and do the following.